Build an sqlite database from your month of data.
# This code only needs to run it once
library(tidyverse)
library(DBI)
d <- read_csv(here::here("data", "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2019_10.csv"))
# install.packages("RSQLite")
con <- dbConnect(RSQLite::SQLite(), "data/:flights:")
copy_to(con, d, "flights",
temporary = FALSE,
indexes = list(
c("FlightDate",
"Reporting_Airline",
"Tail_Number",
"Origin",
"Dest"
)
))
flights_db <- tbl(con, "flights")
feb1 <- flights_db %>%
filter(DayofMonth==1) %>%
select(DayofMonth, Origin, Dest, DepTime, ArrTime,
IATA_CODE_Reporting_Airline, Tail_Number) %>%
collect()
feb1
Explore your data! These tasks should be done using the dplyr interface, so that the tidy wrangling verbs can be used instead of raw SQL functions.
## # A tibble: 5 x 2
## IATA_CODE_Reporting_Airline n
## <chr> <int>
## 1 WN 116465
## 2 DL 86322
## 3 AA 81028
## 4 OO 72350
## 5 UA 55074
WN = Southwest had the most flights in February 2020.
## # A tibble: 5 x 2
## Origin n
## <chr> <int>
## 1 ATL 33643
## 2 ORD 30125
## 3 DFW 26396
## 4 DEN 22752
## 5 CLT 20569
ATL = Atlanta had the most departing flights.
ATL, Atlanta, is the busiest airport. The smallest departure delay for the month was -22 minutes, which means the flight left early, quite early. The longest delay was 896 minutes, almost a day delay. The median delay was -2, which is less than 0. That means that 50% of the flights left before scheduled.
mutate.)forcats package).Overall, there is not much difference in the median delays, and the variation in delays between carriers. Southwest has the highest median delay and Republic Airlines has the lowest median delay. Delta, which has this as the hub, has a small interquartile range in delays but a large number of early flights as well as many delayed flights.
## # A tibble: 2 x 2
## `is.na(DepDelay)` n
## <lgl> <int>
## 1 FALSE 33586
## 2 TRUE 57
df <- tbl(con, "flights") %>%
select(Tail_Number, Origin, Dest, FlightDate, DepTime) %>%
filter(!is.na(Tail_Number)) %>%
filter(!is.na(DepTime)) %>%
mutate(DepTime = as.numeric(DepTime)) %>%
arrange(Tail_Number, FlightDate, DepTime) %>%
collect()
df %>% count(Tail_Number, sort=TRUE)
aircraft <- df %>%
select(Tail_Number) %>%
distinct()
for (i in 1:nrow(aircraft)) {
d <- df %>% filter(Tail_Number == aircraft$Tail_Number[i])
if (nrow(d) > 1) {
for (j in 2:nrow(d)) {
if (d$Origin[j] != d$Dest[j-1]) {
cat(d$Tail_Number[j], d$Dest[j-1], d$Origin[j], "\n")
}
}
}
cat(i, "\n")
}
Here we are going to add a new table with airport information, and use this to make a map of flights.
airports <- read_csv(here::here("data","402312038_T_MASTER_CORD.csv")) %>%
select(-X29)
copy_to(con, airports, "airports",
temporary = FALSE
)
get_map() function in the ggmap package.## [1] "airports" "flights" "sqlite_stat1" "sqlite_stat4"
I chose Delta and Southwest. It looks a little like Delta has more of a hub system, and Southwest is more distributed, serving many more airports.
There’s not a lot to see in four big groups like this. Its an exercise in working with time. And also in ordering the four groups appropriately.